Contents

File creation: July, 05th 2017
Update: July, 05th 2017

1 Data Loading & Description

Briefly, paired PBMCs (Peripheral blood mononuclear cells) and LN (Lymph Node) from patients were stained and sorted for Tfh (T Follicular Helper) and Live cells. There are three HIV+ CP off treatment and three HIV- patients.

Per patient, per sample, two Seq-Well arrays were processed. NB: each patient has the following processed array conditions: LN TFH, LN Live, PBMC TFH, & PBMC Live.

Here - we have four arrays from one HIV- CP patient - TFH041.

Obj: Re-perform the analysis - Look at the data for patient TFH041.

2 Live Cells

2.1 Raw Count

Let’s look at the distribution of UMIs and nb. of detected genes in both LN and PBMC samples for Live Cells.

For LN sample, we have an amplification biais - that’s why we use UMIs. It seems to exhibit a quite proportional capture efficient here, whereas current platforms (DropSeq, etc.) usually exhibit low and highly variable capture efficiency.

Really close nb. of reads, UMIs and detected genes are observed in both LN and PBMC samples from Live Cells.

2.2 QC and Filtering

The next step is to perform quality control and filtering (remove cells with few reads/UMIs or genes, etc.). The threshold that Sam used was 750 UMIs and 400 genes (see Slides for Pantaleo 053017.ppt file). So, let’s use the same here, and also remove genes with less than one UMI count.

As highlighted by Sam, the runs are probably a bit under sequenced.

###--- Filtering datasets and merging
#- LN
identical(colnames(LiveCells_LN_UMI), rownames(LiveCells_LN_UMI_summary)) # TRUE
LiveCells_LN_UMI_filtered <- LiveCells_LN_UMI[, which(LiveCells_LN_UMI_summary$NUM_GENES >= 400 & LiveCells_LN_UMI_summary$NUM_TRANSCRIPTS >= 750)] # 586 cells
LiveCells_LN_UMI_filtered <- LiveCells_LN_UMI_filtered[which(apply(LiveCells_LN_UMI_filtered, 1, sum) != 0), ] # 16,664 genes and 586 cells
#- PBMC
identical(colnames(LiveCells_PBMC_UMI), rownames(LiveCells_PBMC_UMI_summary)) # TRUE
LiveCells_PBMC_UMI_filtered <- LiveCells_PBMC_UMI[, which(LiveCells_PBMC_UMI_summary$NUM_GENES >= 400 & LiveCells_PBMC_UMI_summary$NUM_TRANSCRIPTS >= 750)] # 624 cells
LiveCells_PBMC_UMI_filtered <- LiveCells_PBMC_UMI_filtered[which(apply(LiveCells_PBMC_UMI_filtered, 1, sum) != 0), ] # 15,994 genes and 624 cells
#- Merging
LiveCells_UMI <- rbind.fill(as.data.frame(t(LiveCells_LN_UMI_filtered)), as.data.frame(t(LiveCells_PBMC_UMI_filtered)))
rownames(LiveCells_UMI) <- c(paste(colnames(LiveCells_LN_UMI_filtered), 'LN', sep = '_'), paste(colnames(LiveCells_PBMC_UMI_filtered), 'PBMC', sep = '_'))
dim(LiveCells_UMI) # 1,210 cells and 19,183 genes
length(unique(c(rownames(LiveCells_LN_UMI_filtered), rownames(LiveCells_PBMC_UMI_filtered)))) # 19,183 genes
LiveCells_UMI[is.na(LiveCells_UMI)] <- 0 # NAs - replace NA by 0
LiveCells_UMI <- as.data.frame(t(LiveCells_UMI))

After filtering, we have:

After merging those two datasets together, we have:

2.3 Normalization

As in Gierahn et al. (2017), for each cell, we perform library-size normalization. UMI-collapsed gene expression values for each cell barcode were scaled by the total number of transcripts and multiplied by 10,000. Scaled expression data were then natural-log transformed before analysis using Seurat.
NB: Scaling and log2-transformation actually done with Seurat.

###--- Normalization
#- Seurat object - log2-transformation & library-size normalization - done on Rhino - why? issue with R 3.4.0
seuratObj <- new('seurat', raw.data = LiveCells_UMI)
seuratObj <- Setup(seuratObj, min.cells = 0, min.genes = 0, project = 'Pantaleo', total.expr = 10000) # no need to filter - already done
#- Output
saveRDS(seuratObj, file = 'Output_070517/LiveCells_TFH041.rds')
seuratObj_LiveCells_global <- readRDS('Output_070517/LiveCells_TFH041.rds') # 1,210 cells & 19,183 genes

2.4 Seurat Analysis

Seurat analysis using the normalized dataset. It includes the removing of unwanted sources of variation, PCA, clustering and tSNE. All analyses are done as in Gierahn et al. (2017).

The single cell dataset likely contains uninteresting sources of variation. This could include not only technical noise, but batch effects, or even biological sources of variation (cell cycle stage). As suggested in Buettner et al. (2015), regressing these signals out of the analysis can improve downstream dimensionality reduction and clustering. Seurat implements a basic version of this by constructing linear models to predict gene expression based on user-defined variables. Seurat stores the z-scored residuals of these models in the scale.data slot, and they are used for dimensionality reduction and clustering. We typically regress out cell-cell variation in gene expression driven by batch (if applicable), cell alignment rate (as provided by Drop-seq tools for Drop-seq data), the number of detected molecules and mitochondrial gene expression. For cycling cells, we can also learn a cell-cycle score (as in Macosko et al. (2015)) and Regress this out as well - from Gierahn et al. (2017).
Here, we simply regress on the number of UMIs per cell as well as the percentage mitochondrial gene content.

###--- Regress out unwanted sources of variation
mito.genes <- grep("^MT-", rownames(seuratObj_LiveCells_global@data), value = TRUE)
percent.mito <- colSums(as.matrix(expm1(seuratObj_LiveCells_global@data[mito.genes, ]))) / colSums(as.matrix(expm1(seuratObj_LiveCells_global@data)))
seuratObj_LiveCells_global <- AddMetaData(seuratObj_LiveCells_global, percent.mito, 'percent.mito')
GenePlot(seuratObj_LiveCells_global, 'nUMI', 'percent.mito')
GenePlot(seuratObj_LiveCells_global, 'nUMI', 'nGene') # 0.96
#- remove unwanted sources of variation
seuratObj_LiveCells_global <- RegressOut(seuratObj_LiveCells_global, latent.vars = c('nUMI', 'percent.mito')) # remove effect of CDR (high correlation between nGene and nUMI - 0.96) and mitochondrial

After, Seurat calculates highly variable genes and focuses on these for downstream analysis.

###--- Detection of variable genes across the single cells
seuratObj_LiveCells_global <- MeanVarPlot(seuratObj_LiveCells_global, fxn.x = expMean, fxn.y = logVarDivMean, x.low.cutoff = 0.25, x.high.cutoff = 3, y.cutoff = 0.5, do.contour = FALSE)
length(seuratObj_LiveCells_global@var.genes) # 1,311 highly variable genes

We identify 1,311 variable genes with log-mean expression values greater than 0.25 and dispersion (variance/mean) greater than 0.5.

###--- PCA
#- PCA performed on the scaled data and using only highly variable genes
seuratObj_LiveCells_global <- PCA(seuratObj_LiveCells_global, pc.genes = seuratObj_LiveCells_global@var.genes)
seuratObj_LiveCells_global <- ProjectPCA(seuratObj_LiveCells_global)

We can identify markers that are strongly correlated with cellular heterogeneity - see below.

We performed 1,000 iterations of the t-SNE algorithm using a perplexity value of 50 and the first top 5 principal components. Those first top 5 PCs were chosen according to the standard deviation - see below.

###--- tSNE
seuratObj_LiveCells_global <- RunTSNE(seuratObj_LiveCells_global, dims.use = 1:5, perplexity = 50, do.fast = TRUE)

Let’s look at the tSNE plots now - perplexity of 50 and 1,000 iterations of the Barnes-Hut implementation.

We clearly confirm Sam’s results - there is probably a naming switch between Tfh sorted and Live Cells - see above. We do see two big LN and PBMC clusters - typical of Tfh sorted behavior.

When we look at the expression of some cell markers - see below.

There is only expression of T cells - switch between Tfh sorted and Live Cells - see above.

Let’s look at the clustering proposed by Seurat (Graph-based method).
Cell classification and clustering were performed using a graph-based clustering method implemented in Seurat (FindClusters R function - share nearest neighbor (SNN) modularity optimization based clustering algorithm). Briefly, it calculates k-nearest neighbors and constructs the SNN graph. Then, it optimises the modularity function to determine the best clusters. Four distinct clusters of cells were identified using the top 5 PCs with a neighbourhood size of 40 and resolution of 0.60.

###--- Clustering
seuratObj_LiveCells_global <- FindClusters(seuratObj_LiveCells_global, pc.use = 1:5, resolution = 0.60, print.output = 0, k.param = 40, save.SNN = FALSE)

We observe four clusters: two LN-Tfh sorted clusters and two PBMC-Tfh sorted clusters.

Let’s look at some specific markers found in the paper Raph sent me (06/09/17). In HIV- infected individuals, they have demonstrated a significant increase (2-3 fold) in the Tfh cells defined by the CXCR3+Tbet+ phenotype and IFN-\(γ\)/IL21 production along with a significant decrease (2-3 fold) in Tfh cells expressing CCR4, CCR6 and producing IL-4. So here, I’d like to look at the expression of those specific marker.

Hard to observe anything here.

2.5 MAST Analysis

As we do think that there was a naming switch between Tfh sorted and Live Cells - so that, here, we probably have Tfh sorted. Then, I’d like to look at DEGs between PBMC and LN samples for Tfh sorted as already done with previous patients.

Let’s get DEGs between LN and PBMC (TFH sorted cells).

###--- MAST
#- MAST object
pData <- data.frame(cellNames = colnames(seuratObj_LiveCells_global@data),
                    Condition = unlist(str_split(colnames(seuratObj_LiveCells_global@data), '_'))[seq(2, length(unlist(str_split(colnames(seuratObj_LiveCells_global@data), '_'))), 2)])
fData <- data.frame(symbol = rownames(seuratObj_LiveCells_global@data))
scaMAST <- FromMatrix(as.matrix(seuratObj_LiveCells_global@data), cData = pData, fData = fData)
scaMAST <- MAST::primerAverage(scaMAST, geneGroups = 'symbol') # Average expression values for duplicated/redundant genes
colData(scaMAST)$cngeneson <- scale(colSums(assay(scaMAST) > 0)) # CDR
rownames(scaMAST) <- rowData(scaMAST)$primerid # change row names

#- New MAST object - keep gene with freq > 0.05
scaMAST <- scaMAST[(freq(scaMAST) > 0.05), ] # threshold of the proportion of expression for each gene - 0.05
scaMAST # 4,243 genes & 1,210 cells

#- Look at gene sets of interest (Th1, Th2, Th17, Tfh and Treg)
setdiff(unique(c(Th1, Th2, Th17, Tfh, Treg)), rownames(scaMAST)) # 14 genes removed after filtering (freq > 0.05) - let's add those genes to the dataset for MAST analysis
dataNew <- as.matrix(seuratObj_LiveCells_global@data)[rownames(scaMAST), ]
dataNew <- rbind(dataNew, as.matrix(seuratObj_LiveCells_global@data)[setdiff(unique(c(Th1, Th2, Th17, Tfh, Treg)), rownames(scaMAST)), ])
dim(dataNew) # 4,257 genes & 1,210 cells
fData <- data.frame(symbol = rownames(dataNew))
identical(as.character(pData$cellNames), colnames(dataNew)) # TRUE
scaMAST <- FromMatrix(dataNew, cData = pData, fData = fData) # new MAST object
colData(scaMAST)$cngeneson <- scale(colSums(assay(scaMAST) > 0)) # CDR
rownames(scaMAST) <- rowData(scaMAST)$primerid # change row names

Single-cell gene expression data are known to be zero-inflated and bimodal, which is a feature we observe here as well (see below) - that’s why we use MAST instead of limma.

Adptative thresholding is not needed here - values tend to 0, such as Kelly’s data. It might be because we’re dealing with UMI instead of TPM values or low sequencing depth?

Let’s perform MAST.

###--- DE analysis / Hurdle model
#- zlm model
colData(scaMAST)$Condition
zlmCond <- zlm(~ Condition + cngeneson, scaMAST, method = 'bayesglm', ebayes = TRUE,
               ebayesControl = list(method = "MLE", model = "H1"),
               hook = deviance_residuals_hook) # Hurdle model with cluster & CDR - LN as reference
saveRDS(zlmCond, file = 'Output_070517/zlmCond_LiveCells_TFH041.rds') # save within .rds file
zlmCond <- readRDS('Output_070517/zlmCond_LiveCells_TFH041.rds')

#- Significant results
hypothesisCluster <- c('ConditionPBMC')
registerDoMC(2)
fcHurdle <- foreach(i=1:length(hypothesisCluster)) %dopar%
{
  summaryCond.temp <- summary(zlmCond, doLRT = hypothesisCluster[i]) # Make contrast with the selected hypothesis
  summaryDt.temp <- summaryCond.temp$datatable # Summary (data.table)
  fcHurdle.temp <- merge(summaryDt.temp[contrast == hypothesisCluster[i] & component == 'H', .(primerid, `Pr(>Chisq)`)], # Hurdle unadjusted p-value
                         summaryDt.temp[contrast == hypothesisCluster[i] & component == 'logFC', .(primerid, coef, ci.hi, ci.lo)], # log2 FoldChange coefficient
                         by='primerid')
  fcHurdle.temp[, FDR := p.adjust(`Pr(>Chisq)`, 'fdr')] # FDR adjusted p-value
  return(fcHurdle.temp)
}
names(fcHurdle) <- hypothesisCluster
registerDoMC(2)
fcHurdleSign <- foreach(i=1:length(hypothesisCluster)) %dopar%
{
  fcHurdle.temp <- fcHurdle[[i]]
  fcHurdleSign.temp <- merge(fcHurdle.temp[FDR < 0.05 & abs(coef) > log2(1.5)], # FDR 5% & FoldChange log2(1.5)
                             as.data.frame(rowData(scaMAST)), 
                             by='primerid') # keep only singificant results
  setorder(fcHurdleSign.temp, FDR) # Order according to FDR p-values
  return(fcHurdleSign.temp)
}
names(fcHurdleSign) <- hypothesisCluster 
dim(fcHurdleSign[[1]]) # 47 DEGs

#- Output
write.csv(fcHurdleSign[[1]], file = 'Output_070517/DEGs_LNvsPBMC_LiveCells_TFH041.csv')

We have 47 DEGs - FDR 5% and log2FC > log2(1.5).

We observe, as DEGs, TIGIT, ICOS and MAF. Some genes related to mitochondrion. Results in agreement with previous HIV- patient (TFH146). It is definitely Tfh sorted and not Live Cells. (Tfh cells sorted according to PD-1+ CXCR5+ and ICOS+)

Let’s look at gene sets of interest (Tfh, Th1, Th2, Th17 and Treg).

Those genes are not really expressed. It will be more interesting to compare HIV- and HIV+ patients.

3 Tfh sorted

As I’ve just highlighted that there was a naming switch between Tfh sorted and Live Cells - so, here, let’s analyze the Tfh sorted cells (so actually, the Live Cells).

3.1 Raw Count

Let’s look at the distribution of UMIs and nb. of detected genes in both LN and PBMC samples for Live Cells.

It seems to exhibit a quite proportional capture efficient here, whereas current platforms (DropSeq, etc.) usually exhibit low and highly variable capture efficiency.

We clearly observe that PBMC sample has a really low number of genes / reads and UMIs. It is in agreement with Sam’s results - the matched blood array failed, so we have no way of commenting on the actual Live sort from PBMC in this patient.

3.2 QC and Filtering

The next step is to perform quality control and filtering (remove cells with few reads/UMIs or genes, etc.). The threshold that Sam used was 750 UMIs and 400 genes (see Slides for Pantaleo 053017.ppt file). So, let’s use the same here, and also remove genes with less than one UMI count.

As highlighted by Sam, the runs are probably a bit under sequenced, and it is impossible to analyze PBMC sample here.

###--- Filtering datasets and merging
#- LN
identical(colnames(TFH_LN_UMI), rownames(TFH_LN_UMI_summary)) # TRUE
TFH_LN_UMI_filtered <- TFH_LN_UMI[, which(TFH_LN_UMI_summary$NUM_GENES >= 400 & TFH_LN_UMI_summary$NUM_TRANSCRIPTS >= 750)] # 586 cells
TFH_LN_UMI_filtered <- TFH_LN_UMI_filtered[which(apply(TFH_LN_UMI_filtered, 1, sum) != 0), ] # 15,731 genes and 727 cells
#- PBMC
identical(colnames(TFH_PBMC_UMI), rownames(TFH_PBMC_UMI_summary)) # TRUE
TFH_PBMC_UMI_filtered <- TFH_PBMC_UMI[, which(TFH_PBMC_UMI_summary$NUM_GENES >= 400 & TFH_PBMC_UMI_summary$NUM_TRANSCRIPTS >= 750)] # 624 cells
TFH_PBMC_UMI_filtered <- TFH_PBMC_UMI_filtered[which(apply(TFH_PBMC_UMI_filtered, 1, sum) != 0), ] # 3,345 genes and 12 cells

After filtering, we have:

We clearly confirm Sam’s results.

We can only analyze the LN sample here.

3.3 Normalization

As in Gierahn et al. (2017), for each cell, we perform library-size normalization. UMI-collapsed gene expression values for each cell barcode were scaled by the total number of transcripts and multiplied by 10,000. Scaled expression data were then natural-log transformed before analysis using Seurat.
NB: Scaling and log2-transformation actually done with Seurat.

###--- Normalization
#- Seurat object - log2-transformation & library-size normalization - done on Rhino - why? issue with R 3.4.0
seuratObj <- new('seurat', raw.data = TFH_LN_UMI_filtered)
seuratObj <- Setup(seuratObj, min.cells = 0, min.genes = 0, project = 'Pantaleo', total.expr = 10000) # no need to filter - already done
#- Output
saveRDS(seuratObj, file = 'Output_070517/TFH_LN_TFH041.rds')
seuratObj_TFH_global <- readRDS('Output_070517/TFH_LN_TFH041.rds') # 727 cells & 15,731 genes

3.4 Seurat Analysis

Seurat analysis using the normalized dataset. It includes the removing of unwanted sources of variation, PCA, clustering and tSNE. All analyses are done as in Gierahn et al. (2017).

The single cell dataset likely contains uninteresting sources of variation. This could include not only technical noise, but batch effects, or even biological sources of variation (cell cycle stage). As suggested in Buettner et al. (2015), regressing these signals out of the analysis can improve downstream dimensionality reduction and clustering. Seurat implements a basic version of this by constructing linear models to predict gene expression based on user-defined variables. Seurat stores the z-scored residuals of these models in the scale.data slot, and they are used for dimensionality reduction and clustering. We typically regress out cell-cell variation in gene expression driven by batch (if applicable), cell alignment rate (as provided by Drop-seq tools for Drop-seq data), the number of detected molecules and mitochondrial gene expression. For cycling cells, we can also learn a cell-cycle score (as in Macosko et al. (2015)) and Regress this out as well - from Gierahn et al. (2017).
Here, we simply regress on the number of UMIs per cell as well as the percentage mitochondrial gene content.

###--- Regress out unwanted sources of variation
mito.genes <- grep("^MT-", rownames(seuratObj_TFH_global@data), value = TRUE)
percent.mito <- colSums(as.matrix(expm1(seuratObj_TFH_global@data[mito.genes, ]))) / colSums(as.matrix(expm1(seuratObj_TFH_global@data)))
seuratObj_TFH_global <- AddMetaData(seuratObj_TFH_global, percent.mito, 'percent.mito')
GenePlot(seuratObj_TFH_global, 'nUMI', 'percent.mito')
GenePlot(seuratObj_TFH_global, 'nUMI', 'nGene') # 0.94
#- remove unwanted sources of variation
seuratObj_TFH_global <- RegressOut(seuratObj_TFH_global, latent.vars = c('nUMI', 'percent.mito')) # remove effect of CDR (high correlation between nGene and nUMI - 0.94) and mitochondrial

After, Seurat calculates highly variable genes and focuses on these for downstream analysis.

###--- Detection of variable genes across the single cells
seuratObj_TFH_global <- MeanVarPlot(seuratObj_TFH_global, fxn.x = expMean, fxn.y = logVarDivMean, x.low.cutoff = 0.25, x.high.cutoff = 3, y.cutoff = 0.5, do.contour = FALSE)
length(seuratObj_TFH_global@var.genes) # 827 highly variable genes

We identify 827 variable genes with log-mean expression values greater than 0.25 and dispersion (variance/mean) greater than 0.5.

###--- PCA
#- PCA performed on the scaled data and using only highly variable genes
seuratObj_TFH_global <- PCA(seuratObj_TFH_global, pc.genes = seuratObj_TFH_global@var.genes)
seuratObj_TFH_global <- ProjectPCA(seuratObj_TFH_global)

We can identify markers that are strongly correlated with cellular heterogeneity - see below.

We performed 1,000 iterations of the t-SNE algorithm using a perplexity value of 50 and the first top 6 principal components. Those first top 6 PCs were chosen according to the standard deviation - see below.

###--- tSNE
seuratObj_TFH_global <- RunTSNE(seuratObj_TFH_global, dims.use = 1:6, perplexity = 50, do.fast = TRUE)

We clearly confirm Sam’s results - there is probably a naming switch between Tfh sorted and Live Cells - see above. So here, we are looking to the Live Cells - LN sample

When we look at the expression of some cell markers - see below.

There is expression of B cells and T cells.

Let’s look at the clustering proposed by Seurat (Graph-based method).
Cell classification and clustering were performed using a graph-based clustering method implemented in Seurat (FindClusters R function - share nearest neighbor (SNN) modularity optimization based clustering algorithm). Briefly, it calculates k-nearest neighbors and constructs the SNN graph. Then, it optimises the modularity function to determine the best clusters. Six distinct clusters of cells were identified using the top 6 PCs with a neighbourhood size of 30 and resolution of 0.80.

###--- Clustering
seuratObj_TFH_global <- FindClusters(seuratObj_TFH_global, pc.use = 1:6, resolution = 0.80, print.output = 0, k.param = 30, save.SNN = FALSE)

We observe six clusters: one T cells cluster, one NK cells cluster and about four clusters composed of B cells.

4 Combination of Tfh sorted and Live Cells

Let’s combine Tfh sorted cells and Live Cells together.

###--- Merge
mergedData <- rbind.fill(as.data.frame(t(LiveCells_UMI)), as.data.frame(t(TFH_LN_UMI_filtered)))
rownames(mergedData) <- c(paste(colnames(LiveCells_UMI), 'LiveCells', sep = '_'), paste(colnames(TFH_LN_UMI_filtered), 'TFH', sep = '_'))
dim(mergedData) # 1,937 cells and 20,873 genes
mergedData[is.na(mergedData)] <- 0 # NAs - replace NA by 0
mergedData <- as.data.frame(t(mergedData))

After combining, we have:

4.1 Normalization

As in Gierahn et al. (2017), for each cell, we perform library-size normalization. UMI-collapsed gene expression values for each cell barcode were scaled by the total number of transcripts and multiplied by 10,000. Scaled expression data were then natural-log transformed before analysis using Seurat.
NB: Scaling and log2-transformation actually done with Seurat.

###--- Normalization
#- Seurat object - log2-transformation & library-size normalization - done on Rhino - why? issue with R 3.4.0
seuratObj <- new('seurat', raw.data = mergedData)
seuratObj <- Setup(seuratObj, min.cells = 0, min.genes = 0, project = 'Pantaleo', total.expr = 10000) # no need to filter - already done
#- Output
saveRDS(seuratObj, file = 'Output_070517/combination_TFH041.rds')
seuratObj_global <- readRDS('Output_070517/combination_TFH041.rds') # 1,937 cells & 20,873 genes

4.2 Seurat Analysis

Seurat analysis using the normalized dataset. It includes the removing of unwanted sources of variation, PCA, clustering and tSNE. All analyses are done as in Gierahn et al. (2017).

The single cell dataset likely contains uninteresting sources of variation. This could include not only technical noise, but batch effects, or even biological sources of variation (cell cycle stage). As suggested in Buettner et al. (2015), regressing these signals out of the analysis can improve downstream dimensionality reduction and clustering. Seurat implements a basic version of this by constructing linear models to predict gene expression based on user-defined variables. Seurat stores the z-scored residuals of these models in the scale.data slot, and they are used for dimensionality reduction and clustering. We typically regress out cell-cell variation in gene expression driven by batch (if applicable), cell alignment rate (as provided by Drop-seq tools for Drop-seq data), the number of detected molecules and mitochondrial gene expression. For cycling cells, we can also learn a cell-cycle score (as in Macosko et al. (2015)) and Regress this out as well - from Gierahn et al. (2017).
Here, we simply regress on the number of UMIs per cell as well as the percentage mitochondrial gene content.

###--- Regress out unwanted sources of variation
mito.genes <- grep("^MT-", rownames(seuratObj_global@data), value = TRUE)
percent.mito <- colSums(as.matrix(expm1(seuratObj_global@data[mito.genes, ]))) / colSums(as.matrix(expm1(seuratObj_global@data)))
seuratObj_global <- AddMetaData(seuratObj_global, percent.mito, 'percent.mito')
GenePlot(seuratObj_global, 'nUMI', 'percent.mito')
GenePlot(seuratObj_global, 'nUMI', 'nGene') # 0.96
#- remove unwanted sources of variation
seuratObj_global <- RegressOut(seuratObj_global, latent.vars = c('nUMI', 'percent.mito')) # remove effect of CDR (high correlation between nGene and nUMI - 0.96) and mitochondrial

After, Seurat calculates highly variable genes and focuses on these for downstream analysis.

###--- Detection of variable genes across the single cells
seuratObj_global <- MeanVarPlot(seuratObj_global, fxn.x = expMean, fxn.y = logVarDivMean, x.low.cutoff = 0.25, x.high.cutoff = 3, y.cutoff = 0.5, do.contour = FALSE)
length(seuratObj_global@var.genes) # 881 highly variable genes

We identify 881 variable genes with log-mean expression values greater than 0.25 and dispersion (variance/mean) greater than 0.5.

###--- PCA
#- PCA performed on the scaled data and using only highly variable genes
seuratObj_global <- PCA(seuratObj_global, pc.genes = seuratObj_global@var.genes)
seuratObj_global <- ProjectPCA(seuratObj_global)

We can identify markers that are strongly correlated with cellular heterogeneity - see below.

We performed 1,000 iterations of the t-SNE algorithm using a perplexity value of 50 and the first top 6 principal components. Those first top 6 PCs were chosen according to the standard deviation - see below.

###--- tSNE
seuratObj_global <- RunTSNE(seuratObj_global, dims.use = 1:6, perplexity = 50, do.fast = TRUE)

When we look at the expression of some cell markers - see below.

Same results as Sam’s results.